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Abstract 

Quantile regression is studied in combination with a penalty which promotes structured 
(or group) sparsity. A mixed ^i^oo-norm on the parameter vector is used to impose structured 
sparsity on the traditional quantile regression problem. An algorithm is derived to calculate 
the piece-wise linear solution path of the corresponding minimization problem. A Matlab 
implementation of the proposed algorithm is provided and some applications of the methods 
are also studied. 

Keywords: quantile regression; structured sparsity; variable selection; convex optimiza- 
tion 

1 Introduction 

As [T] have remarked, a good statistical model has seven key properties: 1. Parsimony, 2. 
Tractability, 3. Conceptual insightfulness, 4. Generalizability, 5. Falsifiability, 6. Empirical 
consistency, and 7. Predictive precision. In this paper a structured sparse quantile regression 
model is studied and an efficient algorithm is proposed to solve the corresponding minimization 
problem. As an illustration, two applications are discussed where such a model is preferable 
(in the sense of several of the above properties) to others, and where the proposed algorithm is 
more useful than others. 

The quantile regression model of [2] allows for studying the effect of explanatory variables 
on the entire conditional distribution of the response variable, and not only on its center. In this 
sense quantile regression provides a deeper conceptual insightfulness into data than least squares 
models. Since parsimony is a key property of a good model, variable selection techniques have 
attracted a lot of attention in recent statistical literature, [3]. One reason that sparse models have 
become popular is the availability of very large data sets; see e.g. [Ij. A promising approach 
to achieve a sparse model is to penalize models with a sparsity promoting penalty. Among 
different penalties, the £i-norm is the most popular one. It has e.g. been used in combination 
with least squares in Lasso regression of [5], and an effective algorithm was proposed by [6] 
and [7|. Authors such as [8] and [9| have proposed similar algorithms for the ^i-norm penalized 
quantile regression problem. 

Although the ^i-norm penalty does a good job in selecting individual variables, there is 
no control on the groups of variables it selects. There are cases where one is interested in 
selecting a group of variables instead of individual variables. In some cases it is inevitable, 
see [lOj. Consider e.g. race as a predictor with three levels: black, white, and other. The 
standard approach is making two dummy variables {Xi,X2) out of it: (0,0) for other, (1,0) for 
black, and (0, 1) for white. Obviously in such a case one factor is represented by two variables. 
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Therefore, one may select the pair (Xi, X2) together, or select none of them. The usual £i-norm 
penalty cannot guarantee such behavior. Finding an alternative to the £i-norm seems therefore 
to be necessary in the cases where groups of variables should be selected instead of individual 
variables. As an individual can be considered as a singleton group, the grouping approach should 
be a generalization of the usual £i-norm penalty. 

An appropriate penalty for the grouping case would be a mixed norm which applies £i-norm 
penalty to groups, in order to promote sparsity of groups of variables. Within each group the 
£oo-norm (also known as the max-norm) can be used to ensure that, once a single member of 
a group is chosen (is nonzero), all other members of the same group can grow to the same size 
without penalty. Other choices are possible too, but we restrict ourselves to this mixed £1 — £00- 
norm (or ^i^oo-norm for short) choice. [TT] have introduced such a group sparsity penalty, and 
have solved the problem for the least squares model and have applied it to a low birth weight 
data set. 

In this paper we formulate an algorithm for ^i^co-norm penalized quantile regression instead 
of least squares regression. As tractability is a key property of a good model, we choose the 
mixed norm penalty function because the corresponding models have a simple dependence on 
the penalty parameter. Choosing the ^00-norm within groups guarantees that the model is a 
piece-wise linear function both of the penalty parameter and of the ^i^oo-norm of the model. 
Other choices (such as e.g. £1^2) would lead to less tractable optimization problems. The main 
contribution of the present paper is the description of an efficient computational algorithm for 
solving this optimization problem. This algorithm applies only to the case of non-overlapping 
groups. A Matlab implementation of the algorithm is provided on the authors' webpage [12| . 

In order to illustrate the performance of the proposed penalized model and the corresponding 
optimization algorithm, we study two main applications: Firstly the use of qualitative explana- 
tory variables with more than two levels, and secondly, simultaneous variable selection for a 
vector of response variables. 

For the first application, we analyze a data set pertaining to 'low birth weight' (LBW). 
Low birth weight is a common subject in quantile regression literature (see e.g. P-3J). If one is 
interested in the effect of different variables on the lower tail of the conditional distribution of the 
infant's weight, least squares models are not appropriate, because they analyze the effect of the 
explanatory variables on the conditional mean of the response variable. On the other hand, using 
quantile regression, one can study lower conditional quantiles of the response. Due to presence 
of a qualitative variable with more than two levels in the model, the sparsity promoting ^i-norm 
penalty is not appropriate either. Therefore, a penalized model with a mixed norm penalty 
seems to be more appropriate for selecting effective variables and estimating their effect. 

For the second application, the '93CARS' data set of [H] is studied. It consists of 14 
explanatory variables and 5 response variables which are highly correlated. |15] have suggested 
that a simultaneous variable selection approach is more interesting for modeling a vector of 
correlated response variables with a common set of explanatory variables. Using the proposed 
mixed norm as penalty one can put the coefficients of each explanatory variable for all the 
response variables in a group and perform a simultaneous variable selection. While [T5] worked 
it out in the least-squares framework, we use quantile regression which is more robust and makes 
it possible to study the whole conditional distribution and not just its center. [16] have studied 
the latter problem too, using a standard linear programme solver. In contrast, our proposed 
algorithm produces the whole piece-wise linear solution path (for different values of the penalty 
parameter) which makes the variable selection computationally more efficient. 

Section [2] states the structured sparse quantile regression problem and introduces notation. 
The main tools for describing the corresponding minimization problem are discussed in Sec- 
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tion [31 The algorithm to solve ^i^co-norm penalized quantile regression problem is introduced 
and discussed in Section [H Section illustrates the use of structured quantile regression for the 
analysis of a low birth weight data set. A simultaneous variable selection problem for a vector 
of response variables data set is also examined. The paper is concluded in Section [6l 



2 Problem statement 

In this paper, X„xm = (^i; • • • ,^m) is the design matrix containing the explanatory variables. 
The linear model which we study is: 

y = X(3 + e (1) 
The loss function which quantile regression (see [2j) tries to minimize is given by: 

n 

Y,Qr{{y-xp\) (2) 



4 = 1 

where the function Qt is defined as: 

Or{t) = [ _2ii-r)t ift<0, 

for t € M and < r < 1. For r = 1/2, one recovers Qrit) = \t\. This loss function is more robust 
to outliers than the usual quadratic one: the minimizer of the least squares loss is the mean, 
while the minimizer of the least absolute deviations loss (expression ([2]) with r = 1/2) is the 
median. The optimization problem encountered in quantile regression thus is: 

n 

/3qr = arg min ^Qr{{y- X/3)i) . (4) 
^ i=i 

The so-called ^i-norm = YlY=i ^^i^ known to promote sparsity [5] when used 

as a penalty (added to a loss function) or as a constraint (imposed on the minimizer of a loss 
function). Two major examples are the Lasso of [5] (which combines a quadratic loss with an 
£i-norm penalty or constraint) and sparse quantile regression of e.g. [8] (which combines the 
loss dl]) with an ^i-norm penalty. See also f9|). The advantage of using the ^i-norm instead of 
the number of nonzero /3j is computational: the former leads to a convex minimization problem 
while the latter gives rise to a combinatorial minimization problem. 

In this paper our goal is to impose structure between the m explanatory variables /3i, . . . (3m 
by dividing them into g non-overlapping groups {g < m). Gi,...,Gg represent the indices 
in each group. At the same time, we desire to select a small number of active groups. We 
are therefore interested in imposing a penalty or constraint of ^i-norm type on the different 
groups. Within a given group, when a single member is active (nonzero), we allow the other 
members of that group to reach the same magnitude without penalty. In order to impose such a 
behavior within groups, the max-norm is appropriate. Indeed, the value of maxjgG'j.{|/3j |}does 
not increase as long as the largest of the \f3j\ (with j G Gk) does not increase, regardless of the 
actual size of the smaller ones. 

The mixed norm ||/3||i,oo is defined as 

||/3||i,oo = ^||/3gJ|oo, (5) 
k=l 
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where /Sgj. are the components of /? in group k, and ||/3Gfc||oo = maxjgGj. \l3j\. This mixed norm 
behaves as described above: Within a group, it behaves as a max-norm and between groups it 
imposes an ^i-norm (sum of max-norms). The structured sparse /3 for the quantile regression 
loss function in ([3|) is now defined as the minimizer: 

n 

^ = argminY^Qr {{v - X(3),) + A|l/3||i,oc (6) 
1=1 

where A > is the penalization parameter or as the minimizer: 

n 

j3 = arg min /S^Qr iiv - X(3)i) , (7) 

1=1 

where i? is a nonnegative parameter. By suitably choosing the parameters A and R, the min- 
imizers of these problems are identical. We therefore use the same symbol /3, both for the 
minimizer of the penalized problem ([6]) and for the minimizer of constrained problem d?]). We 
do not indicate explicitly their dependence on A or R. 

The mixed norm penalty ||/3||i,oo has been studied before in the framework of least squares 
loss functions by [TT| . A special case of that least squares problem was already introduced by [15] . 
Here, as in [16], we combine its structure-imposing properties with the robustness properties of 
quantile regression. The aim of this paper is to present an efficient algorithm for the solution of 
the minimization problems ([6|) and ([7]), for various values of A and R, and to illustrate its use 
with some applications. 



3 Solution of the minimization problem 

Due to the presence of the non-smooth functions Qr and ||/3||i,oo) the optimization problems ([6]) 
and d?]) are not differentiable. They are however convex minimization problems for which a 
general theory exists. We refer to e.g. [T7] for an introduction to convex analysis. In particular, 
for any convex function /, the symbol df denotes the subdifferential of /. 

We express the condition for minimizing the cost function of expression ([6]) using subdif- 
ferentials instead of usual derivatives. Necessary and sufficient conditions for optimality of the 
optimization problem ([6|) are found by expressing that belongs to the subdifferential of the 
functional Therefore, we will have that /3 is the minimizer of ([6]) if and only if there exists 
a vector w S d"^^^^ Qr {{y — Xj3)i) and a vector u G 9||/3||i^oo such that: 

- X^w + Att = 0. (8) 

Setting r = y — Xf3, the equations that we need to solve are: 

' -X^w + Xu = 

Wi G dQr {Vi) 

(9) 

n G 5||/3||i,oo 
^r = y-X(3. 

Solving the minimization problems ([6]) and ([7]) therefore requires detailed knowledge of the 
subdifferentials of Qr and of ||/3||i^oo- 
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In case of the function q^- the subdifferential is equal to: 



2t if n > 0, 

dQriri) = { [-2(1 - t),2t] if n = 0, (10) 
-2(1 - r) if ri < 0. 

It is important to remark that, conversely, the knowledge of the value of the subgradient Wi G 
dQr{i^i) also gives a certain knowledge on rj. In particular, if Wi = 2t, then rj must be non- 
negative, if Wi = —2(1 — r), then rj must be non-positive and if Wi belongs to the interval 
] — 2(1 — t),2t[, then rj must be zero. This will be important when describing the algorithm 
that solves the equations Q. 

The subdifferential 5||/5||i_oo for the penalty part ||/3||i,oo is more difficult. As the groups are 
disjoint, each term in the sum ([5]) can be handled separately. Consider group k with ||/3Gfe||oo = 
maxjgGj. \l3j\, and Gk = {ji,j2, ■ ■ ■ ,jik}- The subdifferential of the max function is the convex 
hull of the union of the subdifferentials of the 'maximal' arguments (see e.g. [18j). In our case, 
each argument of the max-function is a function of just a single variable f3j. The subdifferential 
(w.r.t. the single variable f3j) of the absolute value \/3j\ is: 



1 if > 0, 
-1,1] if /3j = 0, (11) 
-1 if /3j < 0, 



and the subdifferential of |/3j| w.r.t. f3j is zero {j / J). 

In this way, when all components of (3 in group k are zero, we find that i9||/3gj. ||oo consists 
of an ^i-ball of radius 1. On the other hand, when all 1^ coefficients in group Gk have the 
same (nonzero) absolute size, 9||/3Gfe||oo consists of a single face of an -£i-ball with radius 1 (the 
signs of the /3j determine which face). If the maximum is reached in 1,2, ... — 1 nonzero f3j 
then the set i9|i/3Gfc |!oo (with elements u) consists of that part of the same face with uj = 
for non-maximal components /3j (i.e. components that are smaller than the maximum of that 
group: < ||/3gJ|oo)- 

Here too, knowledge of £ 9\\/3g,. \\oo yields partial information on fSck- E.g. if \\ug^ ||i < 1 
then all in this group are zero; if ||uGfe|| = 1 and all uj ^ then all /3j in that group have 
the same absolute value. If Uj = then /3j may not be maximal in its group. 

This type of interplay between /3, u, w and r will be used frequently in the algorithm which 
we describe in Section 21 The optimality conditions ([9]) are not analytically solvable for /3. 
However, the problems ([6]) and ([7]) fall into the class of problems described by [19] which allow 
for a piece-wise linear solution path. This means that the minimizers of ([6]) and ^ are piecewise 
linear functions in terms of A or R. Moreover the nodes that determine these piecewise linear 
functions can be calculated analytically (i.e. using linear algebra), provided one starts the 
calculation from R = (which corresponds to /3 = and A large) and proceeds carefully for 
increasing values of R (decreasing values of A). As explained in the next section, a new node of 
the minimizer /3 (as a function of R) appears e.g. when a new group becomes active (becomes 
nonzero) or when an equation among the (y — Xf3)i = is satisfied. There are several more 
events like this possible and they are explained in more detail in the next section. 

4 Algorithm 

The non-iterative algorithm we propose for finding the minimizers of ^ and ([TD is described in 
this section. It is similar to the LARS algorithm of [2J for the Lasso and the algorithm of [8j and 
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[9] for sparse quantile regression (where the number of groups equals the number of explanatory 
variables). In our case, we are not restricted to singleton groups; however, similar to [TI], only 
non-overlapping groups are allowed. 

Following [9], we set u = u^^^ + u^^'^ /X and w = w^^^ + \w^^\ where u^^\u^^\w^^^ and w^^^ 
do not depend on A. In this way equation ^ is equivalent to: 

-XV)+n«=0 and _ X V) + = q. (12) 

As the minimizer /3 of the problem ([7|) is a piecewise linear function of i? = ||/3||i^oo; it is 
characterized by a set of (interpolation) nodes. If the value of f3 is known in these nodes, then 
/3 can also be found for other values of R by linear interpolation. 

The proposed algorithm starts from /3 = for A sufficiently large (A > Amax)- As /3 = 0, one 
finds the value of r = y — X/3 = y and of w: wf'^ = 2r if rj > 0, wf'^ = —2(1 — r) if < 
(assuming that all coefficients of r at this stage are nonzero) and w^^^ = 0. Using equations (|12p . 
one then calculates u = u^'^^ + n'^^^/A. The value of Amax is now found by solving the equations 
= 1 (for all groups k : 1 . . . g). This can be done on a computer as |ju*^'^^ +^i*-^V-^lli,oo is a 
piecewise linear function of A~^. The largest positive value among those numbers is the desired 
value of Amax- The group k for which \\ugi. || = 1 will be the first group to enter the set of active 
groups (in the next step) and become non-zero. Once this break point value of A is known, it 
can be used to update u and w. 

The algorithm then continues a loop with consists of several parts: 

1. Express ^5 as a linear function of R: /3 = (3^^^ + RI3^^\ where R = ||/3||i,oo- This uses the 
knowledge of the active groups, of the maximal set within each group (from the knowl- 
edge of the subgradient u), of the relative signs of these components of /3 (also from the 
knowledge of u), and of the components of y — X(3 that are zero (from the knowledge of 
the subgradient w). 

2. Determine the largest value of R for which the expression P = /3(o) + is valid. The 
expression may cease to be valid when: 

(a) an additional component of r = y — Xf3 becomes zero, 

(b) an active group becomes non-active (all members are zero), 

(c) a non-maximal component of /3 (in some group) becomes equal in absolute value to 
the maximal value in that group. 

3. Once the value of R is calculated, update the variables u and w (as a function of A). 
Here equations ()12p are used together with the knowledge of Wi for the nonzero r^. The 
knowledge of the non-maximal (3j (in each group) is also used to set some Uj's to zero. 

4. Calculate the smallest value of A for which these expressions for u and w are valid. The 
expressions for u and w cease to be valid when: 

(a) One of the \\ugj. \ \i will reach 1 (a new active group will be added to the active set in 
the next step), 

(b) A coefficient of w equals 2r or —2(1 — r) (in this case, an equation Tj = that is 
satisfied in the current step, will no longer be satisfied in the next step), 

(c) in an active group k, one of the coefficients of uq^. becomes equal to 0. In this case, 
the corresponding component of (3 will be of smaller absolute value than the maximal 
value in that group in the next step. 
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5. Continue with step 1 or stop. 



The algorithm may be stopped when the desired maximum number of active groups is reached, 
when A = 0, or when some other suitable stopping criteria is satisfied. 

In this algorithm, steps 1 and 3 require the solution of a linear system of equations. Steps 
2 and 4 require the solution of simple linear equations to determine i? or A at break points. 
Here numerical round-off error may affect the accuracy of these calculations. Unfortunately, 
the decisions (groups entering or leaving the active set, coefficients becoming submaximal in a 
group, . . . ) depend on these numerical results. Round-off errors may therefore lead to the wrong 
decisions being taken by the algorithm. In that case the algorithm fails. This shortcoming is 
common to all the algorithms of this type [UEIIS]. 

When dealing with data X and y containing small integers, or when rows or columns of 
y and X repeat, it is possible that different events (2a-c or 4a-c) occur simultaneously. One 
could e.g. have two groups enter the active set at the same step. Another possibility is that 
a new group becomes active at the same step when a component of an active group becoming 
sub-maximal. Such possibilities are not accounted for in the current implementation of the 
algorithm. This "one-at-a-time condition" [71 p417] is also common to algorithms of this type 
(the work of [3 [51 [9] also does not handle such cases). [71 p438] have proposed to add a small 
amount of jitter to the variables to overcome the problem. 

The most effective way of understanding the proposed algorithm is by going through a 
worked-out example step-by-step. Table [1] lists the complete solution path of a simple example 
with: 

/-4 3 5 \ / 8 \ 

X=[-A 5 1 1 , y=[ 7 1 , (13) 

groups Gi = {1} and G2 = {2,3} and r = 1/2. The solution path is given as a function of R 
and A, and the intermediate values of the subgradients u and w are also given. This example 
was chosen in such a way that every possibility in steps 2 and 4 of the algorithm occurs at least 
once. 

As one can see in the example, there are certain values of A (i.e. A = 17, 37/5, 20/3, . . .) for 
which the minimizer of the penalized problem ^ is not unique. One also sees that between 
these special values of A the minimizer /3 of ([6|) is constant as a function of A (the subgradients 
u and w do change). This behavior is easy to interpret by plotting the loss Yll=i Qr{{y — 
as a function of i? = ||;5||i^oo5 as was done for example ([T3D in Figure [H We see that the graph 
is piecewise linear and that A is locally equal to the slope of this trade-off curve. Therefore, 
between break points, A is constant (several /3 correspond to the same value of A) and at break 
points, A takes on several values (in other words, for several values of A, the solution /3 of Q is 
constant). 

A set of Matlab functions that implement the above algorithm was written by the authors, 
and is available on their web page [12j . 



5 Applications 

The algorithm presented in Section U) gives the entire solution path for different values of R = 
||/3||i^oo in some interval [0,i2max]- Selecting the appropriate value of R (and the corresponding 
coefficients (3) is an important issue for practical purposes. As [20] have proposed, the Bayesian 
information criterion (BIG) of |i21j is a promising information criterion for model selection in 
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# /3 r R X u w Step 

(0,0,0) (8,7,-11) [17,+(X)[ A-i(-12,ll,6) (1,1,-1) 



{0,R,R) ': [0,1] 17 i^S^Tf) (1>1>-1) 1 

(0,1,1) (0,1,-8) 1 [f,i7] ; : 3 

27i 37 /-36 T r\\ /-I 



i2a 

f (^,1,0) (^,1,-1) l|'' 



(0 27 19^1 /Q Q ^161-) 27 r20 37i : : o 

V"' 22 ' 22 / V"' 22 / 22 L 3 ' 5 J • • '-' 

• • [27 31 20 ^ -I) 1 

I22' 2i 3 I ^5 ^i"; 6 ' 6' ^ l„ 

(0,0,^) I [3,f] ; ; 3 

: : [§,2] 3 (-1,t^,T3) (!>-!> "1) 1 

(-2,0,0) (0,-1,-3) 2 [2,3] : : 3 



■■ [2,1] 2 (-1,^,1) (i-1,-1) 1 

(=M =11) (Q Q =23^^ 25 r 20 21 : : Q 

V 12 ' 6 ' 6/ V"'"' 6 12 Ll9'^J • ■ '-' 

• : r25 2291 20 (_-^ (J_ ^ 1 

Ll2' 40 i 19 V ' ^^38' 38 ' / 

(0,0,0) ^ [0' fi] (-1,-1,0) '^(^'"S^'^) 3 



^4a 
i2c 

;2b 

^4a 



;2a 
;2a 



Table 1: The various values of f3,r, R, X,u,w etc. that describe the minimizers ^ and d?]) for 
the simple example ()13p . The last column indicates the step that is taken in the algorithm of 
Sectional Values of (3, r, u oi w that can be found through interpolation (w.r.t R for f3 and r, 
w.r.t A for w and w.r.t. for u) have been replaced by vertical dots for lack of space. 




Figure 1: Minimizers corresponding to the example X and y of equation (jlSp . Left: parameters 
/3 as a function of i? = ||/3||i,oo- Right: the corresponding trade-off curve plotting the loss 
X^ILi Qriiy — Xf3)i) as a function of the penalty R = ||/3|[i^oo- The slope of this curve is equal 
to -A. 
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quantile regression. In view of Theorem 2 of [8j, and the loss function in the adapted BIC 
for quantile regression is as follows: 

BIC(i?) = log Sr ((y - - nn, (14) 

where /3 is a function of R and riR is defined as the number of zeros in the residual vector 
r = y — Xf3. The model with smaller BIC is more desirable (see e.g. [3]). 

In this section two main applications of structured sparse quantile regression are studied 
using real data sets as illustrations. 



5.1 Low birth weight data set 

According to {22] . low birth weight (LBW) is defined as a birth weight of a liveborn infant of 
less than 2500g regardless of gestational age. LBW has negative effect both on the infants and 
the parents, e.g mothers of LBW babies have a greater chance of having postpartum depression 
and they need more time before returning to work, infants who are born with LBW are at 
greater risk of having learning or vision difficulties. Also, LBW infants would impose large 
costs on society. The risks of LBW are discussed by many authors such as, [23], [H] and [25] . 
Therefore, determining the effective factors in LBW infants is very important. If one is interested 
in studying the effects of different factors on the lower tail of the conditional distribution of 
infants' weight, [13] have remarked that using the least squares regression methods (e.g. [11]) is 
not reasonable. Using quantile regression with r-th quantile (r < 0.5) would give the possibility 
to study lower tail of the infants weight given the explanatory variables. 

As in [llj, the data are taken from ^J. The data set contains the birth weight (expressed 
in grams) of 189 infants as the response variable and 8 explanatory variables: mother's age (in 
years), mother's weight (in pounds), mother's race ((1,0) = black, (0,1) = white or (0,0) = 
other), smoking status during pregnancy (1 =yes or =no), number of previous premature 
labours (0, 1,2,.. .), history of hypertension (1 = yes or = no), presence of uterine irritability 
(1 = yes or = no), number of physician visits during the first trimester (0, 1, 2, . . .). The data 
were collected at Baystate Medical Center, Springfield, Massachusetts, in the year 1986. 

As [13] and [11] have suggested, some non-linear effects of two of the quantitative predictors 
(mother's weight and age) may exist. Therefore, in accordance with |13j we consider a second- 
order polynomial for both of them. We put each corresponding pair of variables in one group. 
A variable which needs a non-singleton group is race (it is a nominal variable with more than 
two levels) . Thus in order to study its effect one may create two dummy variables out of it (see 
also Section [T]). As was already mentioned in the introduction, both or none of them should be 
included in the model. So we may put them both in one group. All other groups are singletons. 

As it is mentioned in [13], and considering Tukey's dictum: Never estimate intercepts, al- 
ways estimate centercepts, the quantitative variables in the model are centered and re-scaled by 
dividing by their standard deviations. Therefore, the estimated intercept may be interpreted as 
the weight of an infant born to a 23 year old mother, whose weight was 130 pounds, her race was 
other (neither black nor white), she was a non-smoker during her pregnancy, with on average 
0.16 previous premature labours, no history of hypertension, no presence of irritability, and an 
average of 0.47 physician visits during the first trimester. 

A small amount of jitter is added to the variables y and X, to guarantee that the one-at-a- 
time condition mentioned at the end of Section [His satisfied (see e.g. also ^ p438]). We have 
verified that this does not change the outcome of the numerical experiments. 
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x = 0.1 x = 0.5 x = 0.9 




Figure 2: Analysis of the LBW data set. The first row shows (part of) the solution path given 
by the algorithm in Section U] for r = 0.1,0.5,0.9. The vertical dashed line indicates the model 
chosen by the BIC (for R = Rbic)- The second row shows the corresponding coefficients /3 for 
the BIC model. 

Figure [2] (top) presents (part of) the solution path of the model coefficients $ for for r = 0.1, 
r = 0.5, and r = 0.9. A vertical dashed line is drawn at -R = -Rbic to indicate the coefficients 
chosen by the BIC (|14p . The bottom row of this figure contains the three corresponding model 
coefficients (chosen by BIC). As one may see, the models chosen for various values of r are 
different, i.e. the effective variables for the lower tail of the conditional distribution are different 
from the ones for the median and for the upper tail. Since we are interested in studying the 
effective variables on the LBW, using loss functions such as least-squares (with minimizer equal 
to the conditional mean) would only give partial information. This result is in accordance with 
|27j who remarked that the effects of the explanatory variables in LBW data set are not constant 
across the conditional distribution of the infant's weight. In other words, for different quantiles 
one may have different models. 

5.2 Simultaneous variable selection for a vector of response variables 

Consider Y = {yi,y2, ■ ■ ■ , Up) a vector of possibly correlated response variables. One is interested 
in modelling these variables using a common set of explanatory variables X = (xi, X2, ■ ■ ■ , Xm)- 
For variable selection purposes, one may consider p linear models yj = X/3 + e, (j = 1, . . . ,p) 
separately and find e.g. the penalized model solution for /3 for each model. But as [15j have 
suggested, it is sometimes interesting to select variables by considering all p response variables 
simultaneously, specially when these p variables are correlated. 

Suppose we have n observations yij (with i = 1, . . . ,n and j = 1, . . . ,p) for Y = {yi,y2, ■ ■ ■ , yp) 
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as response variables and X = (xi, X2, • • • , Xm) as explanatory variables. Let (A; = 1, . . . , m 
and j = 1,... ,p) be the regression coefficient of regressed on yj. As [15] have observed, 
\\{hi,h2, ■ ■ ■ ,/3fcp)||oo = max{|/3fci|, \Pk2\, ■ ■ ■ , |/3fcp|} is a reasonable measure of the explanatory- 
power of the regressor x^ on all p response variables yj simultaneously. If the least-squares loss 
of [lIS] is replaced by the more robust loss function ([3]), the following optimization problem: 

n p / m \ m 

/3 = argmin^^ I yij - "^Xikl^kj | + A ^ max{|/3fei |, |/3fe2|, • • • , \/3kp\}- (15) 
i=l j=l V k=l J k=l 

should be solved to select the variables. Equivalently, one could also use the constrained formu- 
lation taking the form: 

n p / m 

/3 = arg niin £'r J/ij - V Xifc^^j 

Er=imax{|/3fci[,[/3fci|,...,|&p|}<K^^ V ^ 

As the argument of Qr in these last two expressions is a linear function of the Pkj , problems (|15|) 
and ()16p are special cases of problems ([6]) and ([7]), and can therefore be solved by the algorithm 
of Section [H As [15] pointed out, this is an exploratory tool for identifying a suitable subset 
of regressor variables, not for actual parameter estimation. The problem ()15p has already been 
proposed by [16] who solved it for a fixed value of R using a generic linear programming code. 
The algorithm of Section S] finds the minimizer for a whole range of values of R. The latter 
algorithm is therefore more useful as the BIC criterion (jl4p (also used by [16j) requires a further 
minimization over many values of the parameter R. 

As an example, we consider the '93CARS' data set which contains information on 93 new 
cars for the 1993 model year which is obtained from [1^. Table [2] presents the variables we have 
considered. Some of the observations have been omitted due to missing values, so in total 82 
observations are used. 

Figure [3] shows the pairwise scatter plots of the 5 response variables. The two main points 
which follow from this figure are: 1. the variables are correlated, and 2. there are some outliers 
in the data. Therefore, a simultaneous variables selection using least absolute deviation (QR 
with r = 0.5) seems reasonable here. Both response variables y and regression variables X are 
standardized, so one would be able to compare variables in different measures. 

The algorithm of Section [5] is used with r = 0.5. In this application too, a small amount of 
jitter is added to the variables y, so as to guarantee that the one-at-a-time condition mentioned 
at the end of section H] is satisfied (see e.g. also [3 p438]). We have again verified that this 
does not change the outcome of the numerical experiments. Figured] (left) shows the presence 
or absence of each group (as a functions of i? = ||/3||i^oo)- 

The BIC (fnj) is used to select a model among all possible models calculated along the 
solution path. The corresponding value of ||/3||i^oo is called i?BiC) and is indicated on the first 
panel with a dotted line. The final model is given in Figured] (right). As one may see, the 
variables X2,X8,xii and xi2 were not selected. 

6 Conclusions 

A structured sparse solution (or group sparse solution) of a quantile regression model, based 
on penalizing or constraining the quantile regression loss function by a mixed £i^oo-norm of 
regression coefficients, was discussed. 
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Figure 3: Pairwise scatter plots of the 5 response variables in the 93CARS data set. The 
histogram of each response variable y is presented on the diagonal. 




123456789 1011121314 123456789 1011121314 



Explanatory variables index Explanatory variables index (k) 

Figure 4: Structured sparse quantile regression of the 93CARS data set. Left: the £oo-norm of 
coefficients /3's in each group (vertical) as a function of R (horizontal). Right: (top) presence or 
absence of each group along the solution path, (bottom) the ^oo-norm of coefficients /3 in each 
group for R = .Rbic (selected using the BIC ([HI)). 
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Table 2: Variables in 93CARS data set. 





Variable 


Description 




Minimum price (in $1,000) 


Price for basic version of this model 


Vo 


Midrange price (in $1,000) 


Average of Min and Max prices 


uo 


Maximum price (in $1,000) 


Price for a premium version 




City MPG 


miles per gallon by EPA rating 




Highway MPG 






Number of cylinders 




To 


T^(rip"iTip sizp 


in liters 


X'i 
•^o 


Horsepower 


maximum 


X4 


RPM 


revs per minute at maximum horsepower 


^0 


Engine revolutions per mile 


(in highest gear) 




Fuel tank capacity 


in gallons 


X7 


Passenger capacity 


in persons 




Length 


in inches 


Xg 


Wheelbase 


in inches 


a;io 


Width 


in inches 




U-turn space 


in feet 


X12 


Rear seat room 


in inches 




Luggage capacity 


in cubic feet 


Xi4 


Weight 


in pounds 



An algorithm to compute the solution of the corresponding minimization problem was pre- 
sented. This algorithm computes the minimizer of the penalized or constrained loss function for 
all values of i? = ||/S||i^oo within a range [0, i?max] instead of just for a single value of R. This 
is a strong point when using the BIC criterion (which needs a further minimization over R) for 
model selection. 

In a first application, the effective variables for the lower and upper quantiles of the condi- 
tional distribution of the birth weight of infants in the LBW data set were identified, subject 
to a group sparsity constraint. As a second application we studied the problem of simultane- 
ous variable selection in a quantile regression model for robustly modeling a vector of possibly 
correlated response variables using a common set of explanatory variables. 

The implementation of the algorithm presented in Section H] is not straightforward. There- 
fore, such an implementation in Matlab is provided on the authors' webpage [12], together with 
the scripts for processing the LBW data set and the 93CARS data set. The necessary func- 
tions for interpolating the solution (3 between the nodes and choosing the best model using the 
proposed BIC are also provided. 

The current short article dealt with the non-overlapping group case. An potential extension 
of the algorithm would consist of including the overlapping group case as well. 
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